Person Name Identification in Chinese Documents Using Finite State Automata
نویسندگان
چکیده
This research is about automatic identification and extraction of person names in Chinese text documents. Solutions to this problem have immediate and extensive applications in many areas especially in Web Intelligent Agents related applications such as Web search engines, Web data mining, and automatic Web information analysis. We have noted that while finite state automata (FSA) based techniques have been extensively used in NLP and IE in English, they have not yet been extensively used in processing Chinese text, and in particular, to our knowledge, no work has been reported in using FSA in person name identification and extraction. Motivated by this need, we have proposed a person name identification method based on FSA, called NICF. Evaluations show that NICF works very well in terms of identification recall and accuracy, as well as the processing speed, and thus holds a great promise for future applications.
منابع مشابه
Reduction of Computational Complexity in Finite State Automata Explosion of Networked System Diagnosis (RESEARCH NOTE)
This research puts forward rough finite state automata which have been represented by two variants of BDD called ROBDD and ZBDD. The proposed structures have been used in networked system diagnosis and can overcome cominatorial explosion. In implementation the CUDD - Colorado University Decision Diagrams package is used. A mathematical proof for claimed complexity are provided which shows ZBDD ...
متن کاملExtracting Personal Names from Email: Applying Named Entity Recognition to Informal Text
There has been little prior work on Named Entity Recognition for ”informal” documents like email. We present two methods for improving performance of person name recognizers for email: emailspecific structural features and a recallenhancing method which exploits name repetition across multiple documents.
متن کاملIndirect Spatial Data Extraction from Web Documents
An approach for indirect spatial data extraction by learning restricted finite state automata from web documents created using Bulgarian language are outlined in the paper. It uses heuristics to generalize initial finite-state automata that recognizes only the positive examples and nothing else into automata that recognizes as larger language as possible without extracting any non-positive exam...
متن کاملCombine Person Name and Person Identity Recognition and Document Clustering for Chinese Person Name Disambiguation
This paper presents the HITSZ_CITYU system in the CIPS-SIGHAN bakeoff 2010 Task 3, Chinese person name disambiguation. This system incorporates person name string recognition, person identity string recognition and an agglomerative hierarchical clustering for grouping the documents to each identical person. Firstly, for the given name index string, three segmentors are applied to segment the se...
متن کاملSuccinctness of two-way probabilistic and quantum finite automata
HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003